# Multimodal Embedding
Unime Phi3.5 V 4.2B
MIT
UniME is a general embedding learning model based on a multimodal large model, focusing on breaking down modal barriers to achieve cross-modal retrieval and embedding learning.
Multimodal Alignment
Transformers English

U
DeepGlint-AI
54
4
Omniembed V0.1
MIT
A multimodal embedding model based on Qwen2.5-Omni-7B, supporting unified embedding representations for cross-lingual text, images, audio, and video
Multimodal Fusion
O
Tevatron
2,190
3
Nomic Embed Multimodal 3b
Nomic Embed Multimodal 3B is a cutting-edge multimodal embedding model focused on visual document retrieval tasks, supporting unified text-image encoding, achieving an outstanding performance of 58.8 NDCG@5 in the Vidore-v2 test.
Text-to-Image Supports Multiple Languages
N
nomic-ai
3,431
11
Colnomic Embed Multimodal 3b
ColNomic Embed Multimodal 3B is a 3-billion-parameter multimodal embedding model specifically designed for visual document retrieval tasks, supporting unified encoding of multilingual text and images.
Multimodal Fusion Supports Multiple Languages
C
nomic-ai
4,636
17
Finseer
The first retriever specifically designed for financial time series forecasting, based on the Retrieval-Augmented Generation (RAG) framework
Large Language Model
Transformers English

F
TheFinAI
13
1
Nitibench Ccl Human Finetuned Bge M3
MIT
A fine-tuned version of BAAI/bge-m3 on Thai legal query data, supporting dense retrieval, lexical matching, and multi-vector interaction
Text Embedding Other
N
VISAI-AI
51
1
Llave 7B
Apache-2.0
LLaVE-7B is a 7-billion-parameter multimodal embedding model based on LLaVA-OneVision-7B, capable of embedding representations for text, images, multiple images, and videos.
Multimodal Fusion
Transformers English

L
zhibinlan
1,389
5
Llave 2B
Apache-2.0
LLaVE-2B is a 2-billion-parameter multimodal embedding model based on Aquila-VL-2B, featuring a 4K token context window and supporting embeddings for text, images, multiple images, and videos.
Text-to-Image
Transformers English

L
zhibinlan
20.05k
45
Llave 0.5B
Apache-2.0
LLaVE is a multimodal embedding model based on the LLaVA-OneVision-0.5B model, with a parameter scale of 0.5B, capable of embedding text, images, multiple images, and videos.
Multimodal Fusion
Transformers English

L
zhibinlan
2,897
7
Vit Base Patch16 Siglip 512.webli
Apache-2.0
Vision Transformer model based on SigLIP architecture, containing only the image encoder part, using original attention pooling mechanism
Image Classification
Transformers

V
timm
702
0
Dse Qwen2 2b Mrl V1
Apache-2.0
DSE-QWen2-2b-MRL-V1 is a dual-encoder model specifically designed for encoding document screenshots into dense vectors to facilitate document retrieval.
Multimodal Fusion Supports Multiple Languages
D
MrLight
4,447
56
Bge M3 Gguf
MIT
GGUF quantized version of the bge-m3 embedding model, suitable for efficient text embedding tasks
Text Embedding
B
lm-kit
2,885
10
Nomic Embed Vision V1.5
Apache-2.0
High-performance visual embedding model, sharing the same embedding space with nomic-embed-text-v1.5, supporting multimodal applications
Text-to-Image
Transformers English

N
nomic-ai
27.85k
161
Nomic Embed Vision V1
Apache-2.0
High-performance vision embedding model, sharing the same embedding space with nomic-embed-text-v1, supporting multimodal applications
Text-to-Image
Transformers English

N
nomic-ai
2,032
22
Bge M3 Onnx
MIT
BGE-M3 is an embedding model that supports dense retrieval, lexical matching, and multi-vector interaction, converted to ONNX format for compatibility with frameworks like ONNX Runtime.
Text Embedding
Transformers

B
aapot
292
29
Siglip Base Patch16 224
SigLIP is a vision-language pre-trained model suitable for zero-shot image classification tasks.
Text-to-Image
Transformers

S
Xenova
182
1
Clip Vit Base Patch16
OpenAI's open-source CLIP model, based on Vision Transformer architecture, supporting cross-modal understanding of images and text
Text-to-Image
Transformers

C
Xenova
32.99k
9
Chinese Clip Vit Base Patch16
The base version of Chinese CLIP, using ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder, trained on a large-scale dataset of approximately 200 million Chinese image-text pairs.
Text-to-Image
Transformers

C
OFA-Sys
49.02k
104
Featured Recommended AI Models